This document details visualization in anvio

Anvio is a program for visualizing metagenomic data and creating a pangenome. Anvio is run in a dedicated environment.

conda activate anvio-7.1

Get bin info

Get bin info into a format that anvio can use. This means concatenating the bin files for each method, so there’s a list of which contig/read goes in which bin.

# get all bin directories
path <- list.dirs("../data/Bins")

# for loop for each binning method
for (i in 2:7){
  
  DF <- NULL
  
  pathname <- path[i]
  filelist <- list.files(paste0(pathname, "/"))
  
  # get list of all contigs and reads for all bins into 1 tsv file
  for (filename in filelist){
    df <- read.csv(paste0(pathname, "/", filename), header = F)
    df <- as.data.frame(df)
    colnames(df) <- "read"
    df$bin <- str_replace(filename, "[.]", "_")
    
    # change names if a number is at the beginning of the bin name
    if (basename(pathname) == "24_sample_bam_bins"){
      df$bin <- str_replace(df$bin, "24", "twentyfour")
    }
    if (basename(pathname) == "47_sample_bam_bins"){
      df$bin <- str_replace(df$bin, "47", "fortyseven")
    }
    
    DF <- rbind(DF, df) 
  }
  
  write.table(DF, paste0("../output/all_bins/", basename(pathname), ".tsv"), row.names = F, col.names = F, quote = F, sep = "\t")
  
}

Examine output tsv files.

tsv_output <- read.csv("../output/all_bins/assembly_bins.tsv", sep = "\t")
kable(head(tsv_output, 5))
MG1058_s821.ctg000852l assembly_bin_1
MG1058_s1105.ctg001148l assembly_bin_1
MG1058_s1585.ctg001645l assembly_bin_1
MG1058_s1820.ctg001893l assembly_bin_1
MG1058_s645.ctg000674l assembly_bin_10
MG1058_s914.ctg000951l assembly_bin_10

Import bins into anvio

Get the bin information into the anvio database already created.

# Example for one bin import, change import and -C for each
anvi-import-collection "./github/jordan-marinimicrobia/output/all_bins/short_reads_bam_bins.tsv" -p "./Downloads/plus_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_plus/1058_P1_2018_585_0.2um_assembly_plus.db" --contigs-mode -C shortreads

Run interactive browser

Use the interactive browser to visualize the metagenome and bins.

anvi-interactive -p "./Downloads/plus_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_plus/1058_P1_2018_585_0.2um_assembly_plus.db"

This is an example of what the interactive browser looks like with bins. Anvio calculates statistics like completion and redundancy for each bin.

Anvio interactive browser with bins

Refine bins

Dig into “contaminated” bins to see how/why they are contaminated. Reminder that the naming scheme for bins has different syntax– “.” is changed to “_” and 24 and 47 are written out in the anvi bin database.

anvi-refine -p "./Downloads/assembly_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_only/1058_P1_2018_585_0.2um_assembly.db" -C shortreads -b short_reads_bam_bin_163

Example of a contaminated bin. The coverage is not consistent, there are many branches in the clustering algorithm anvio uses to group sequences, and there are a plethora of duplicated single copy core genes.

Anvio interactive display of a contaminated bin

Get summary statistics

I used anvi-interactive to get a summary of all bins in each bin collection. This is an example output file that contains information on size of bins, contamination, etc.

summary <- read.table("../output/anvio_outputs/assembly_plus_summary.txt", sep = "\t", header = T)
summary(summary)
##      bins            total_length       num_contigs           N50         
##  Length:274         Min.   :  202581   Min.   :   1.00   Min.   :  10203  
##  Class :character   1st Qu.:  310692   1st Qu.:   9.00   1st Qu.:  12386  
##  Mode  :character   Median :  488296   Median :  23.00   Median :  14620  
##                     Mean   :  879731   Mean   :  52.00   Mean   :  74294  
##                     3rd Qu.:  896170   3rd Qu.:  50.75   3rd Qu.:  50200  
##                     Max.   :20079188   Max.   :1582.00   Max.   :3034959  
##    GC_content    percent_completion percent_redundancy   t_domain        
##  Min.   :26.47   Min.   :  0.00     Min.   :   0.00    Length:274        
##  1st Qu.:38.99   1st Qu.:  0.00     1st Qu.:   0.00    Class :character  
##  Median :45.84   Median :  0.00     Median :   0.00    Mode  :character  
##  Mean   :47.44   Mean   : 13.16     Mean   :  16.19                      
##  3rd Qu.:57.02   3rd Qu.: 23.59     3rd Qu.:   0.00                      
##  Max.   :69.50   Max.   :100.00     Max.   :2053.52                      
##    t_phylum           t_class            t_order            t_family        
##  Length:274         Length:274         Length:274         Length:274        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    t_genus           t_species        
##  Length:274         Length:274        
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Creating pangenome

I will be making a pangenome using the assembly plus bins because these are lower in contamination but contain more contigs than just the assembly bins. In addition to my bins, I got three complete genomes from cultured Sulfitobacter pontiacus and added those to my pangenome.

I created anvio databases (dbs) for my bins and annotated them with COG, Kegg, HMMs, and tRNAs. I used the interactive database to visualize the pangenome, and find variable regions of the Sulfitobacter genome. This is what I based the next markdown document on.

anvi-gen-contigs-database -f sulf_genomes/assembly_plus_bin_4.fa -o sulfbin4.db

anvi-run-hmms -c sulf_genomes/dbs/sulfbin4.db
anvi-run-scg-taxonomy -c sulf_genomes/dbs/sulfbin4.db
anvi-scan-trnas -c sulf_genomes/dbs/sulfbin4.db
anvi-run-ncbi-cogs -c sulf_genomes/dbs/sulfbin4.db
anvi-run-kegg-kofams -c sulf_genomes/dbs/sulfbin4.db

anvi-gen-genomes-storage -e sulf-external-genomes.txt \
                         -o sulf-GENOMES.db

anvi-pan-genome -g sulf-GENOMES.db -n sulfitobacter

anvi-display-pan -g sulf-GENOMES.db -p sulfitobacter/sulfitobacter-PAN.db

Pangenome visualization. The variable regions are labeled, and sulf 1, 2, and 3 are the cultured genomes.

Sulfitobacter pangenome